White Wine Quality Analysis by Lee Clemmer

In this analysis I will be investigating which chemical properties influence the quality of white wines.

Univariate Plots Section

## [1] "No. of Observations and No. of Variables"
## [1] 4898   12
## [1] "Variable Names"
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## [1] "Data Structure"
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
## [1] "Summary of Variables"
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Wine quality ratings are distributed as integers between 3 and 9. Only 5 wines out 4,898 were rated a 9, while only 20 received the lowest score of 3.

##            n
## 1 0.08922009

pH values are normally distributed around the mean of 3.188. Almost 9% fall below a pH of 3. Since pH describes how acidic or basic something is, I wonder if there is a tight relationship between pH and the other acidity related properties.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I increased the number of bins to get some better resolution. It looks like fixed acidity is normally distributed around the mean of 6.855, with a couple outliers beyond 11.

Volatile acidity is mostly normally distributed around the mean of .2782 with a slight positive skew. I wonder what the relationship between fixed and volatile acidity is.

Citric acidity is normally distributed around the mean of 0.3342, with a couple extreme outliers beyond 1.1. We can see peaks at 0.5 and 0.75, wondering if that a common amount of citric acid added to wine. Again wondering what the relationship is between all acidity related variables.

Residual sugar levels distributions show a peak around 2 and due several extreme outliers most of the distibution is on the left side of the histogram. To get a better look at the distribution patterns I applied a square root transformation to the x-axis. I would characterize the shape of the distribution as multimodal with several peaks and valleys The lowest such valley occurs between 3 and 4 before dropping off at around 18. There was only 1 wine with greater than 45 grams/liter sugar, which is considered sweet.

Due to the outliers in the positively skewed long tail I again applied a square root transformation to the x-axis to get a better sense of the shape of the bulk of the distribution. The pattern followed mostly a normal distribution around the mean of 0.04577, with a bit of a positive longtail.

## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   868

There are 868 wines with levels of free sulfur dioxide greater than 50, at which point it becomes evident in the nose and tast of the wine. I’ve added a derived binary variable to the data set that captures whether the wine exceed the threshold or not. I wonder what effect on quality this might have. I’ve also added another variable: free sulfur as a percentage of total sulfur dioxide. Perhaps the balance of free and bound forms of SO2 has an effect on quality?

Mostly normal distribution with some outliers beyond 250. I expect a strong correlation between Total Sulfur Dioxide and Free Sulfur Dioxide as the latter is a subset of the former.

Normally distributed with a touch of positive skew. Since sulphates can contribute to sulfur dioxide gas, I expect a strong correlation between sulphates and total sulfur dioxide.

Alcohol levels fall between 8 and just over 14 % alcohol by volume, with a positively skewed distribution peaking at around 9.5.

## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1654           7.9            0.330        0.28           31.6     0.053
## 1664           7.9            0.330        0.28           31.6     0.053
## 2782           7.8            0.965        0.60           65.8     0.074
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 1654                  35                  176 1.01030 3.15      0.38
## 1664                  35                  176 1.01030 3.15      0.38
## 2782                   8                  160 1.03898 3.39      0.69
##      alcohol quality free.sulfur.dioxide.evident free.so2.pct.of.total
## 1654     8.8       6                       FALSE             0.1988636
## 1664     8.8       6                       FALSE             0.1988636
## 2782    11.7       6                       FALSE             0.0500000
## Source: local data frame [1 x 1]
## 
##       n
##   (int)
## 1   937

In order to look at density level I decided to cut off a couple of the extremem outliers. The shape is normal around the mean 0.9940. I wonder whether humans can really detect such small variations in liquid density, and whether that would have impact on quality.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations in which wines with 11 various chemical properties were rated on a scale of 0 to 10 by 3 wine experts.

The data include the following variables:

  1. fixed acidity (tartaric acid - g / dm^3): most acids involved with wine are fixed or nonvolatile (do not evaporate readily)
  2. volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  3. citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  5. chlorides (sodium chloride - g / dm^3): the amount of salt in the wine
  6. free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  7. total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. density (g / dm^3): the density of wine is close to that of water depending on the percent alcohol and sugar content
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. sulphates (potassium sulphate - g / dm^3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  11. alcohol (% by volume): the percent alcohol content of the wine
  12. quality (score between 0 and 10): Output variable (based on sensory data)

Some other observations: * No wine was scored below 3 nor above 9; the median was 6. * The median alcohol content was 10.4% by volume. * pH levels varied between a minimum of 2.72 and a maximium of 3.82, with the median at 3.18.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in this data set is quality. I’d like to know if any of the chemical properties are highly correlated to quality and could be used to predict which wines are going to be better than others.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

After doing the univariate analysis, I’m actually not quite sure which variables will have the biggest effect on quality. There are two variable clusters - acidity (pH, Fixed Acidity, Volitile Acidity, and Citric Acid) and sulfur dioxide (Free Sulfur Dioxide, Total Suflur Dioxide, Sulphates) - that I think will exhibit strong correlation within one another. I wonder about the impact on quality of density and alcohol level as these aren’t necessarily taste related.

Did you create any new variables from existing variables in the dataset?

I created one new variable based on the fact that at a level of 50ppm free sulfur dioxide becomes evident in taste; the variable captures whether this taste is evident or not (TRUE/FALSE).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most distributions were normal, and most also had a handful of extreme outliers. The most unusual distribution was perhaps of residual sugar, which in addition to a large peak on the left and multiple smaller peaks.

Bivariate Plots Section

##                             fixed.acidity volatile.acidity  citric.acid
## fixed.acidity                  1.00000000      -0.02269729  0.289180698
## volatile.acidity              -0.02269729       1.00000000 -0.149471811
## citric.acid                    0.28918070      -0.14947181  1.000000000
## residual.sugar                 0.08902070       0.06428606  0.094211624
## chlorides                      0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide           -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide           0.09106976       0.08926050  0.121130798
## density                        0.26533101       0.02711385  0.149502571
## pH                            -0.42585829      -0.03191537 -0.163748211
## sulphates                     -0.01714299      -0.03572815  0.062330940
## alcohol                       -0.12088112       0.06771794 -0.075728730
## quality                       -0.11366283      -0.19472297 -0.009209091
## free.sulfur.dioxide.evident   -0.02794808      -0.03070756  0.118821472
## free.so2.pct.of.total         -0.13945918      -0.19616085  0.016241396
##                             residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity                   0.08902070  0.02308564       -0.0493958591
## volatile.acidity                0.06428606  0.07051157       -0.0970119393
## citric.acid                     0.09421162  0.11436445        0.0940772210
## residual.sugar                  1.00000000  0.08868454        0.2990983537
## chlorides                       0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide             0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide            0.40143931  0.19891030        0.6155009650
## density                         0.83896645  0.25721132        0.2942104109
## pH                             -0.19413345 -0.09043946       -0.0006177961
## sulphates                      -0.02666437  0.01676288        0.0592172458
## alcohol                        -0.45063122 -0.36018871       -0.2501039415
## quality                        -0.09757683 -0.20993441        0.0081580671
## free.sulfur.dioxide.evident     0.24122018  0.09426740        0.7149309817
## free.so2.pct.of.total           0.05142979 -0.03321768        0.7386321024
##                             total.sulfur.dioxide     density            pH
## fixed.acidity                        0.091069756  0.26533101 -0.4258582910
## volatile.acidity                     0.089260504  0.02711385 -0.0319153683
## citric.acid                          0.121130798  0.14950257 -0.1637482114
## residual.sugar                       0.401439311  0.83896645 -0.1941334540
## chlorides                            0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide                  0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide                 1.000000000  0.52988132  0.0023209718
## density                              0.529881324  1.00000000 -0.0935914935
## pH                                   0.002320972 -0.09359149  1.0000000000
## sulphates                            0.134562367  0.07449315  0.1559514973
## alcohol                             -0.448892102 -0.78013762  0.1214320987
## quality                             -0.174737218 -0.30712331  0.0994272457
## free.sulfur.dioxide.evident          0.452612740  0.25679614 -0.0586619285
## free.so2.pct.of.total               -0.013447850 -0.06552475  0.0008012900
##                               sulphates     alcohol      quality
## fixed.acidity               -0.01714299 -0.12088112 -0.113662831
## volatile.acidity            -0.03572815  0.06771794 -0.194722969
## citric.acid                  0.06233094 -0.07572873 -0.009209091
## residual.sugar              -0.02666437 -0.45063122 -0.097576829
## chlorides                    0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide          0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide         0.13456237 -0.44889210 -0.174737218
## density                      0.07449315 -0.78013762 -0.307123313
## pH                           0.15595150  0.12143210  0.099427246
## sulphates                    1.00000000 -0.01743277  0.053677877
## alcohol                     -0.01743277  1.00000000  0.435574715
## quality                      0.05367788  0.43557472  1.000000000
## free.sulfur.dioxide.evident  0.04096416 -0.24623432 -0.090581598
## free.so2.pct.of.total       -0.02236186  0.06446642  0.197214077
##                             free.sulfur.dioxide.evident
## fixed.acidity                               -0.02794808
## volatile.acidity                            -0.03070756
## citric.acid                                  0.11882147
## residual.sugar                               0.24122018
## chlorides                                    0.09426740
## free.sulfur.dioxide                          0.71493098
## total.sulfur.dioxide                         0.45261274
## density                                      0.25679614
## pH                                          -0.05866193
## sulphates                                    0.04096416
## alcohol                                     -0.24623432
## quality                                     -0.09058160
## free.sulfur.dioxide.evident                  1.00000000
## free.so2.pct.of.total                        0.46986612
##                             free.so2.pct.of.total
## fixed.acidity                         -0.13945918
## volatile.acidity                      -0.19616085
## citric.acid                            0.01624140
## residual.sugar                         0.05142979
## chlorides                             -0.03321768
## free.sulfur.dioxide                    0.73863210
## total.sulfur.dioxide                  -0.01344785
## density                               -0.06552475
## pH                                     0.00080129
## sulphates                             -0.02236186
## alcohol                                0.06446642
## quality                                0.19721408
## free.sulfur.dioxide.evident            0.46986612
## free.so2.pct.of.total                  1.00000000

Some initially surprising correlations are found: quality has moderate positive correlation (.42) with alcohol and a weak negative correlation with density (-.29).

Let’s look at these a bit closer.

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

In the scatter plots we can see that as quality increase, levels of alcohol tends to be higher, as also shown by the linear smoothing line. This becomes even more apparent when studying the boxplot and summarizing median alcohol levels per quality rank. Wines rated 7 and above have a median alcohol level of 11.4 and higher, while wines rated 6 and below have a median alcohol level of between 9.5 and 10.5.

Let’s take a look at quality vs. density.

## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

## ww$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## ww$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## ww$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## ww$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## ww$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## ww$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## ww$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

When plotting density against quality, we see visually the negative correlation. It appears that as density increases, quality decreases. This trend is particularly noticable at grades 7, 8, and 9. When we look at the boxplot of the data, we can indeed see that the median for 7, 8, and 9 are below the lower grades, which have a median of between .9937 and .9957. The higher quality wines have median densities of between .9903 and .9918.

This is surprising! I wouldn’t have guessed that density would have been one of the more well correlated variables. However, we know that “the density of wine is close to that of water depending on the percent alcohol and sugar content”. And in fact this is exactly what the data bears out.

Let’s take a closer look at density. It has a strong positive correlation with residual sugar (.83, the strongest correlation found between all the variables) and a strong negative correlation with alcohol (-.77).

## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

## Warning: Removed 1451 rows containing non-finite values (stat_smooth).
## Warning: Removed 1466 rows containing missing values (geom_point).

We can see a clear relationship between density and residual sugar: as residual sugar increases, so does density. As the description of the dataset indicated, residual sugars do indeed rarely go lower than 1, as indicated by the dotted red line. We also noticed, as hinted at by the histogram of residual sugar, that a large cluster of wines have sugar levels between 1 and 2.

## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

We clearly see that as the alcohol levels increase, density decreases. Let’s see what the relationship looks like between alcohol and sugars.

## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).

As expected, alchol and sugar have negative correlation: the more sugar is left after fermentation, the less alcoholic the wine. I assume this because the sugar has not been converted to alcohol, and therefore the wine is less alcoholic, and more sweet.

## mapping: intercept = intercept, slope = slope 
## geom_abline: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

The relationship between the presence of sulfur dioxide (SO2) and quality of wine is bit murky. If we consider free SO2 as a percentage of total SO2, we find a weak positive correlation (.19), and for total SO2 we find a weak negative correlation (-.17). In other words, the less SO2 the better, and the less bound SO2 (not free), the better for wine quality.

## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).

Among the acidity related features, volatile acidity has the strongest correlation, albeit a weak negative one (-.26): the more volatile acidity is present, the lower the quality. As was mentioned in the feature description, too much of this acidity and the wine begines to take on a vinegar taste.

## Warning: Removed 110 rows containing non-finite values (stat_smooth).
## Warning: Removed 110 rows containing missing values (geom_point).

Finally, chlorides are also weakly negatively correlated with quality (-.23); the more chlorides are in the wine, the worse the quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The following features had the most effect on quality, in descending order of absolute correlation strength: alcohol (.42), density (-.29), volatile acidity (-.26), chlorides (-.23), and total sulfur dioxide (-.23). I was surprised both that alcohol had the strongest effect (I wouldn’t think this alone would say anything about quality), and the fact the residual sugars had such little correlation (-0.09) since it was so strongly correlated to both alcohol and density. I had also expected either the presence sulfur dioxide or acidity to have a greater correlation, but as it stands each only has a weak correlation with quality. I wonder if together these features could build a robust linear regression model with good predictive power.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I was able to confirm the relationships between some features as they were described. For example, the density of wine was strongly correlated with both alcohol and residual sugar.

What was the strongest relationship you found?

The strongest correlaton I found was between residual sugar and density at .83. The more sugar is left after fermentation, the higher the density of the wine.

Multivariate Plots Section

## Warning: Removed 18 rows containing missing values (geom_point).

Exploring the relationship between sugar, density, and alcohol a bit further, we can see the three features interact in the above plot. What we see is that the variation in density as sugar levels increase are explained neatly by the alcohol content: the higher the alcohol content, the lower the density, at all points on the sugar level spectrum.

## Warning: Removed 3 rows containing missing values (geom_point).

Studying the effect on alcohol and density on quality in the grid of plots above, we notice that the distribution of wines shifts from top left (lower alcohol, higher density) to bottom right (higher alcohol, lower, density) as quality increases.

## Warning: Removed 18 rows containing missing values (geom_point).

Taking a look at the same plot grid but with sugar instead, we notice that as quality increases, sugar levels drop.

## Warning: Removed 9 rows containing missing values (geom_point).

Studying the effect of total sulfur dioxide, we see that the weight of the distribution shifts from right (more SO2) to left (less SO2), indicating again that quality goes down with increasing levels of sulfur dioxide.

## Warning: Removed 39 rows containing non-finite values (stat_smooth).
## Warning: Removed 65 rows containing missing values (geom_point).

Finally, taking a look at the interaction of some of the acidity features, we find that citric acid and fixed acidity have a weak positive correlation (.26). We also see several bands along values of citric acid of .5 and .75, corresponding to the peaks we saw in our citric acid histogram. The pH colors of the plot reveals, unsurprisingly, that the more acidic the wine, the lower the pH level.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Through multivariate analysis we were able to underline the import relationships between levels of alchol, residual sugar, density, and their effect on the quality of the wine. Without a doubt wines that are less dense, more alcoholic, and have less sugar tend to be higher rated. We were also able to verify again that total sulfur dioxides tend to decrease wine quality.

Were there any interesting or surprising interactions between features?

The only surprise was there weren’t stronger correlations on wine. Sulphates were generally not a feature that had any impacts. The SO2 and acid features had weak effects on level of quality. No unusual relationships were found that hadn’t already been hinted at in the feature descriptions.


Final Plots and Summary

Plot One

## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).

Description One

We can see a clear relationship between density and residual sugar: as residual sugar increases, so does density. As the description of the dataset indicated, residual sugars do indeed rarely go lower than 1, as indicated by the dotted red line. We also noticed, as hinted at by the histogram of residual sugar, that a large cluster of wines have sugar levels between 1 and 2.

Plot Two

## Warning: Removed 18 rows containing missing values (geom_point).

Description Two

Exploring the relationship between sugar, density, and alcohol a bit further, we can see the three features interact in the above plot. What we see is that the variation in density as sugar levels increase are explained neatly by the alcohol content: the higher the alcohol content, the lower the density, at all points on the sugar level spectrum.

Plot Three

## Warning: Removed 9 rows containing missing values (geom_point).

Description Three

The effect of total sulfur dioxide and density on quality: we see that the weight of the distribution shifts from right (more SO2) to left (less SO2), indicating that quality goes down with increasing levels of total sulfur dioxide.


Reflection

I started my investigation of nearly 5000 white wines by studying the description of the various features. In them lay some hints about the relationship of the variables that I was able to confirm over the course of the analysis. Without any real domain knowledge, I was expecting to find that the features describing various levels of acidity, the presence of sulfur dioxide, and chlorides would have the greatest impact on the level of quality. In the end, however, I was surprised to find out that it was really the density of the wine and the relationship between density, alcohol, and residual sugar that had the greatest effect on the quality of the wine.

More broadly speaking, it was a valuable exercise in diving into a dataset without any prior knowledge and getting to know the ins and out through exploration. The clear next step would be to start to develop predictive models that could guess the quality of the wine depending on the values of various features.

Reference

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib